feat: Support Spark Expression Encode by YutaLin · Pull Request #4315 · apache/datafusion-comet

YutaLin · 2026-05-13T20:11:19Z

Which issue does this PR close?

Closes #3183

Rationale for this change

Support expression Encode

What changes are included in this PR?

Add StringEncode in string serde
Update shims in spark3.4/3.5/4.0/4.1/4.2 to catch Encode

How are these changes tested?

Add encode.sql and run it in spark 3.4/3.5/4.0

andygrove · 2026-05-13T20:35:31Z

Thanks @YutaLin. LGTM overall. Could you address feedback, then I'll kick off CI

YutaLin · 2026-05-13T22:17:05Z

Hi @andygrove, thanks for the review!
I've extract encode method and add null check.

About "Spark accepts utf8 as an alias for UTF-8", spark only supports alias before 3.5, because it uses JDK Charset.forName. After 4.0, it has a whitelist check, so it doesn't support alias. I'd suggest we keep only utf-8 now, WDYT?

https://spark.apache.org/docs/4.0.0/sql-migration-guide.html#upgrading-from-spark-sql-35-to-40

Since Spark 4.0, the encode() and decode() functions support only the following charsets ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’, ‘UTF-32’. To restore the previous behavior when the function accepts charsets of the current JDK used by Spark, set spark.sql.legacy.javaCharsets to true.

…atafusion-comet into 3183_support_spark_expression_encode

…expression_encode # Conflicts: # docs/source/user-guide/latest/expressions.md

coderfender

Left some minor comments but overall looks good @YutaLin

coderfender · 2026-05-16T23:35:43Z

+      binding: Boolean): Option[Expr] = {
+    charset match {
+      case Literal(str, DataTypes.StringType)
+          if str != null && str.toString.toLowerCase(Locale.ROOT) == "utf-8" =>


This seems like a fragile check . Perhaps a better check would be to use StandardCharsets ?

coderfender · 2026-05-16T23:36:52Z

+import org.apache.comet.serde.CommonStringExprs
+import org.apache.comet.serde.ExprOuterClass.Expr
+
+trait ShimCometExprs extends CommonStringExprs {


seems like the trait name seems to be different from Comet<> ?

coderfender · 2026-05-16T23:37:32Z

+INSERT INTO test_encode_utf8 VALUES ('hello'), ('world'), (''), ('café'), (NULL)
+
+query
+SELECT encode(s, 'utf-8') FROM test_encode_utf8


I believe spark accepts any variant of utf-8 (utf8 /UTF8 etc) . May be a good idea to add those tests ?

coderfender · 2026-05-16T23:46:03Z

+              BooleanType) =>
+        val Seq(value, charset, _, _) = s.arguments
+        stringEncode(expr, charset, value, inputs, binding)
+


Seems like we missed removing this piece ?

YutaLin added 6 commits May 13, 2026 14:53

feat: add stringEncode in CommonStringExprs

fdc962a

feat: add encode check in CometExprShim

91f1b57

chore: spotless check

d46c128

test: add encode sql

8ff9922

docs: support encode

3c08105

Merge branch 'main' into 3183_support_spark_expression_encode

ef35199

andygrove reviewed May 13, 2026

View reviewed changes

Comment thread spark/src/main/spark-4.0/org/apache/comet/shims/CometExprShim.scala Outdated

andygrove reviewed May 13, 2026

View reviewed changes

Comment thread spark/src/main/scala/org/apache/comet/serde/strings.scala Outdated

YutaLin added 2 commits May 13, 2026 17:19

refactor: extract encode to common shim

16d4f89

feat: add null check

bc1bbcf

Merge branch 'main' into 3183_support_spark_expression_encode

f507beb

YutaLin requested a review from andygrove May 13, 2026 22:17

YutaLin added 5 commits May 13, 2026 20:42

fix: lint issue

3e55030

Merge branch '3183_support_spark_expression_encode' of work:YutaLin/d…

0b54aef

…atafusion-comet into 3183_support_spark_expression_encode

fix: scalafix check

dd2f63d

Merge branch 'main' into 3183_support_spark_expression_encode

4448c18

Merge remote-tracking branch 'upstream/main' into 3183_support_spark_…

7f0866d

…expression_encode # Conflicts: # docs/source/user-guide/latest/expressions.md

coderfender reviewed May 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support Spark Expression Encode#4315

feat: Support Spark Expression Encode#4315
YutaLin wants to merge 14 commits into
apache:mainfrom
YutaLin:3183_support_spark_expression_encode

YutaLin commented May 13, 2026

Uh oh!

Uh oh!

Uh oh!

andygrove commented May 13, 2026

Uh oh!

YutaLin commented May 13, 2026

Uh oh!

coderfender left a comment

Uh oh!

coderfender May 16, 2026

Uh oh!

coderfender May 16, 2026

Uh oh!

coderfender May 16, 2026

Uh oh!

coderfender May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

YutaLin commented May 13, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Uh oh!

Uh oh!

andygrove commented May 13, 2026

Uh oh!

YutaLin commented May 13, 2026

Uh oh!

coderfender left a comment

Choose a reason for hiding this comment

Uh oh!

coderfender May 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderfender May 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderfender May 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderfender May 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants